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Abstract: Inference based on the penalized density ratio model is pro- 
posed and studied. The model under consideration is specified by assum- 
ing that the log-likelihood function of two unknown densities is of some 
parametric form. The model has been extended to cover multiple samples 
problems while its theoretical properties have been investigated using large 
sample theory. A main application of the density ratio model is testing 
whether two, or more, distributions are equal. We extend these results by 
arguing that the penalized maximum empirical likelihood estimator has less 
^J ' mean square error than that of the ordinary maximum likelihood estima- 

(-H ^ tor, especially for small samples. In fact, penalization resolves any existence 

.^.j , problems of estimators and a modified Wald type test statistic can be em- 

C^ 1 ployed for testing equality of the two distributions. A limited simulation 

study supports further the theory. 
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1. Introduction 
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^r~ • The density ratio model is specified by assuming that the log-ratio of two un- 

f-^ I known probability density functions is linear in some parameters, see 

Qin and Zhang (1997) and Qin (1998). The model is motivated by consider- 
ing a logistic regression model for a binary random variable Y, which assumes 
S^ . the values 1 and 2-where "1" denotes success-and X a d-dimensional vector of 

H I covariates, see Anderson (1972, 1979), Breslow and Day (1980), Cox and Snell 

(1989), for example. Then, the logistic regression model expresses the probabil- 
ity of the event {Y = 1} as a function of X by 

exp (a* + f3'x) 



P[Y ^l\x] 



1 + exp (a* -t- f3'x) ' 



where a* is a scalar parameter and /3 is a d x 1 vector of regression coeffi- 
cients. The logistic model leads to the density ratio model when considering 



* Part of this work was carried out while the author was visiting the Department of Statis- 
tics, University of Munich. Discussions and support by L. Fahrmeir and K. Strimmer are 
greatly acknowledged. The comments of a reviewer improved the presentation. 
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case-control or retrospective sampling, see Prentice and Pyke (1979), Farewell 
(1979). Suppose that Xii,...,Xi„j is a random sample from G{x \ Y = 1) 
and X2i,...,X2„2 is another independent sample from G{x \ Y = 2). Set 
TT = P [y = 1] and g{x \ Y = i) = dG{x \ Y = i)/dx, for the conditional 
probability density function of X given Y = i, i = 1,2. Bayes' theorem shows 
that 

/ I V ^ = exp a +(3x) 

g{x\Y = 2} TT 

= exp {a + (3'x) , 

with a = a* + log{(l — 7r)/7r}. The last equation justifies the term density 
ratio model: the densities of the observations are related by a parametric expo- 
nential tilt, but otherwise are unknown. In general, consider the following two 
independent samples semiparametric problem: 

Xii,...,Xi„j is a random sample from gi{x) ~ cxjp [a + f3' h{x)) g2{x) , 
X21, . . . , X2n2 is a random sample from 92(2;), 

(1) 
where gi{x), i = 1,2 are unknown probability density functions. The quantity a 
is an unknown scalar, f3 is a, d- dimensional vector of parameters while h{x) is a 
d-dimensional vector which consists of known functions of X . Several examples 
of distribution fall within the above framework, in particular the exponential 
family of distributions satisfies (1) trivially. An important observation is that 
when the model holds, and if /? = 0, then the two samples are identically dis- 
tributed. We conclude that model (1) is useful to the semiparametric comparison 
of two samples in the sense that the densities gi{.), i = 1,2 are left completely 
unspecified but the weight function exp (a -|- (3'h(x)) depends on some finite 
dimensional parameter. The last remark connects the density ratio model and 
biased sampling theory, see Vardi (1982, 1985), Gill et al. (1988), Gilbert et al. 
(1999) and Gilbert (2000). 

Inference regarding both finite and infinite dimensional parameters of model 
(1) has been studied by various authors assuming that the sample sizes tend 
to infinity in a suitable way. The methodology is based on empirical likelihood 
inference, see Owen (2001). Accordingly, a parametric likelihood function for 
the finite dimensional parameters is obtained after profiling out the infinite di- 
mensional parameter of the model. However, there are applications where the 
sample sizes are small and therefore direct application of the empirical likelihood 
methodology might not be suitable. To overcome these problems we put forward 
a penalized empirical likelihood function for inference, see (4), which depends 
on two forms, a regularization parameter and a penalty function. The regular- 
ization parameter controls the amount of penalization whereas the penalty term 
is a function of the finite dimensional parameter of the model. The choice of the 
penalty function is independent of the infinite dimensional parameter as it will 
be explained in Section 2 where the methodology is developed in detail. Using 
penalized empirical likelihood leads to solution of several problems including 
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elimination of non existence and non convergence issues. In addition approxi- 
mations even for relatively small sample sizes are adequate enough to allow for 
testing. Section 3 reviews several facts about the resulting estimators and shows 
that a judicious choice of the regularization parameter leads to consistent and 
asymptotically normal estimators. The last section examines the finite sample 
performance of the proposed estimators by some limited simulations. The paper 
closes with several remarks and an appendix. 

2. Penalized likelihood inference 

Suppose that Xij, j ~ 1, . . . , n^, i = 1,2 are two independent random samples 
from cumulative distribution functions Gi{x), i = 1,2, respectively. Let gi{x) be 
the corresponding density functions, that is gi{x) = dGi{x)/ dx, and denote by 
X = (Xii,Xi2, . . . ,Xi„j,X2i, . . . ,-''^2712) the vector of all observations. Suppose 
further that the density ratio model (1) holds. Inference for the finite dimen- 
sional parameters 9 = {a, (3')' is based on the so-called empirical likelihood-see 
Owen (2001). Accordingly, let Pij denote the size of the jump at the observed 
datum Xij, that is Pij = dG2{xij) = G2{xfj) - G'2(x~), j = 1, 2, . . . , n^, i = 1, 2 
and consider the following nonparametric likelihood given the data, 

L(a,/3,G2|x) = Jnexp(a + /3'Ma;i,))rfG2(xi,)U J];rfG2(x2j)l 

{2 n, If"! 1 

nnp^4 n°^p("+/5'M^i^)) ■ (2) 

Following Qin and Zhang (1997) and Fokianos et al. (2001), elimination of 
the infinite dimensional parameter G2(.) is accomplished by maximizing 
the first product of (2) subject to the constraints X]i=iEi=i-Py — 1' ^^^ 
J2i=i X]?=i Pij (cxp(a + P'h{xij ) — 1)) = 0. Avoiding unnecessary repetition, the 
empirical log-likelihood is given by 

2 Ui ni 

1=1 j=i j=i 

where pi = ni/n2. Furthermore, ii 9 ~ {a, 13)' denotes the maximum likelihood 
estimator of 9, assuming that it exists, then it can be shown that 

1 1 

Pij = ■ 

"2 l-l-piexp(d-|-/3'/i(a;,j)) 

Hence a consistent estimator for both of Gi(.) and G2(.) can be constructed 
provided that the total sample size n = m + n2 tends to infinity such that 
ni/n2 -^ pi-see Qin and Zhang (1997) and Fokianos et al. (2001) for further 
details. 
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Expression (3), after reparametrization, is equivalent to standard logistic re- 
gression likelihood-a direct consequence of the equivalence between retrospec- 
tive and prospective sampling, see Prentice and Pykc (1979). Accordingly, infer- 
ence for the parameter 9 is carried out by numerous statistical programs which 
include logistic regression modeling. However, it is well known that if the sample 
size is large, then the logistic likelihood equations yield the maximum likelihood 
estimator provided that it exists. In this case, the score equations are solved 
by the Fisher scoring method and under some reasonable regularity assump- 
tions a sequence of approximations is derived that converges to the maximum 
likelihood estimators of a and (3. In contrast, convergence and existence issues 
arise when the sample size is relatively small. When the sample size is small, 
maximum likelihood estimates fail to exist in general if the data are completely, 
quasi-completely or partially separated. More specifically, there does not exist 
a maximum likelihood estimator of j3 if there exists a hyperplane in the co- 
variate space such that when F = 1, the covariates belong to the left side of 
the hyperplane whereas for F = 2, the covariates belong to the other side of 
the hyperplane (see Albert and Anderson (1984), McCuUagh and Nelder (1989, 
Sec. 4.4) and Santncr and Duffy (1989, p. 234)). The discussion shows that ap- 
plication of the density ratio model is questionable when only a few observations 
belong to each group. 

However, regularization of the log -likelihood function (3), in the sense of 
adding a concave penalty term, avoids existence and convergence issues, see 
also Hastie and Tibshirani (2004). Consider the so-called penalized empirical 
log-likelihood function 

lp{e) = m - A„ J(/3). (4) 

where l{0) denotes the unrestricted log-likelihood function given by (3), A„ is 
a sequence of regularization parameter controlling the amount of shrinkage and 
J(.) is a penalty function on the parameter (3. The parameter a-the intercept- 
is not penalized explicitly because it is a function of /3 when (1) holds. It is 
well known that when A„ —t then (4) yields to the unrestricted maximum 
likelihood estimator whereas if A„ — > oo then 6 shrinks towards 0. 

We argue that it is reasonable to employ the profile empirical likelihood for 
inference. Expression (4) is not a proper log-likelihood function in the sense 
that the first summand, namely the quantity l{0), is the outcome of a pro- 
filing procedure derived by means of empirical likelihood methodology. How- 
ever, recent work in the area of semiparametric statistical inference shows that 
such functions share the same properties of common likelihood functions, see 
Murphy and van der Vaart (2000). Thus, it is sensible to penalize (3) by in- 
troducing an extra concave term which does not depend on the distribution 
function G2. The method resolves successfully the existence problem of the max- 
imum likelihood estimator of 9 and yields estimates of the unknown distribution 
functions even for relatively small samples. 

Motivation of penalized empirical likelihood (4) stems also from a Bayesian 
point of view. Suppose that the density ratio model (1) holds conditionally on 
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(3 and suppose that /? is a random variable with a density proportional to 

7r(/?)cxexp(-A„J(/3)). 

The above display points out to a crucial point, namely the independence be- 
tween the prior of /? and the cdf G2 ■ It is precisely this fact that enables to base 
inference on the posterior likelihood function 

^(/3|G2,x)ocL(a,/3,G2|x)7r(/3). 

By setting lp{9) = log7r(/3|G2,x) we obtain the penalized log-likelihood ex- 
pression (4). Indeed, maximization of the last display with respect to p^'s is 
equivalent to maximization of (2) since the second factor does not depend on 
the infinite dimensional parameter. That is the penalized log-likelihood function 
still produces valid cdf estimators that satisfy the constraints imposed by the 
density ratio model. We conclude that maximization of 7r(/3|G2, x) has no effect 
on the final form of the profile log-likelihood function as long as the designated 
function of (3 that multiplies (2) does not depend on G2 . Note that the method- 
ology is quite general and it is rather interesting to study its properties for a 
wider class of examples associated with empirical likelihood theory. 

The penalized log-likelihood equations (4) depends on the choice of regular- 
ization parameter and the penalty term J(/3). The selection of A„ can be based 
on existing cross-validation methods but it is unclear what are the properties 
of such an approach especially when few observations are available. There are 
various approaches to choose the penalty function-perhaps the most popular 
being J(/3) = J2j=i P'j leading to the so called ridge regression type estimators 
(see Le Cessie and Van Houwehngen (1992) for the case of the logistic regres- 
sion when the number of covariates is large). A more general family of penalty 
functions is given by J(/?) = X]i=i 7fe^'(/^j)i '^i^h 7^ > and leads to several 
well known examples. For instance, if 7^ = 1 and '0(/3j) =| (3,j |, then J(/3) re- 
duces to the L^ penalty (see Tibshirani (199G)) while the L' penalty is obtained 
when ■0(/3j) =| /3j |'', for g > 0, see Frank and Friedman (1993). Insight on the 
choice of penalty function is given in the recent articles of Antoniadis and Fan 
(2001) and Fan and Li (2001). However, in what follows consider the penalty 
function 

d 

>/(/3) = E I /3. I', 9>1- (5) 

In particular, the ridge penalty will be employed in applications since the linear 
combination f3'h{x) is expected to vary smoothly in accordance with (1). 

3. Main results 

To fix notation, suppose that 9 = [a, (3')' denotes the maximum likelihood esti- 
mator derived by maximization of the unrestricted log-likelihood (3) and let 9^ 
denote the constrained maximum likelihood estimator obtained by maximizing 
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(4). In what follows, 6*0 denotes the true value of the parameter 9 when (1) 
holds. Define 

J~oo 1 + pi exp(a + li'h{x)) 

Mrit)= r J^^P(" + /?;M^)) Ux)dG,{x\ A2i=A,i(oo), 

J_oo 1 + Pi exp(a + P'h(x)) 

A22{t)= f -^^^^^±^^P§^M^)h'{x)dG2{x), ^22=^22(00), 
7-00 l + Piexp(a + /3'/i(a;)) 

and put 

The following lemma summarizes some useful facts regarding the large sample 
behavior of both the score function and the Hessian matrix. It is used for the 
proofs of Theorems 3.1, 3.2 and 3.3 and it is stated for completeness of the 
presentation. 

Lemma 3.1 (Qin and Zhang (1997)) Suppose that the density ratio model (1) 
holds and assume that A is nonsingular. Denote by 

a {d + \) -dimensional vector. Then 

• The score function of the unrestricted likelihood (3) is asymptotically nor- 
mal, 

n-'/^Vl{9o)^Md+iiO,V), (6) 

as n -^ 00 such that nxjni — > p\. 

• The Hessian matrix of the unrestricted likelihood (3) converges in proba- 
bility, as n s- 00 such that ni/n2 -^ pi, to 

-VW^^A.5. (7) 

The large sample behavior of the score and the Hessian matrix requires a 
number of standard regularity conditions to be satisfied. However, in the context 
of logistic regression, these conditions can be easily verified, see for instance 
Santner and Duffy (1989). Hence, the density ratio model can be applied to 
a variety of settings provided that the log likelihood ratio of two probability 
densities whose support is identical, is a sufficiently smooth function. Several 
well known examples of distributions fall within this class and therefore model 
(1) is applicable to a large collection of problems. 
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The first task is to establish existence of 9^ ^ that is whether the problem of 
maximizing (4) is well defined and if the resulting estimator is unique. When 
considering penalty functions of the form (5)-and more generally sums of smooth 
convex functions-the answer is given by the following result: 

Lemma 3.2 (Fu (1998)) For the penalty function (5) and for given q > 1 and 
Xn > there exists a unique solution of the penalized score equations 

vip{e) = vi{e) - A„vj(/?). 

The unique solution is equal to the unique estimator of the maximization problem 
maxg lp(9), provided that VZ(0) is continuously differentiable with respect to 9 
and 'S7^l{9) is positive semi definite. 

The equivalence of the density ratio model to logistic regression implies that 
both conditions of Lemma 3.2 are satisfied, or, in other words, a unique re- 
stricted estimator 9'^ exists. Summarizing, the issue of non existence of max- 
imum likelihood estimators in the context of logistic regression is successfully 
resolved by penalization. Thus, the density ratio model is applicable to situa- 
tions where the available sample size is small. Furthermore, it will be shown 
that the choice of A„ = 0{^/n) yields to y^-consistent estimators. In the case 
of A„ = o(y^) then 9^ is consistent. These conclusions are supported further 
by some limited simulation results which are reported in the next section. 

Theorem 3.1 Suppose that the density ratio model (1) holds and assume the 
regularity conditions of Lemma 3.1. Assume that the true parameter vector 9q 
lies in a compact set. Suppose further that the penalty function is given by (5) 
and \nl\/n — f Xq > 0. If n ^ oo in such a way that ni/n2 —^ pi, then there 
exists a unique maximizer 9^ of (4) such that \\ 9^ — 9o ||= Op(n~^'^). In 
particular, if Xq = 0, then 9^ is consistent. 

In addition, estimates of the unknown distribution functions are computed 
by setting 



Then 



and 



"-2 1 + p;^ exp(a^ + /3^ h{xij)) 



1=1 3 = 1 



2 Ui 

EI 

i=i j=i 






where /(.) is the indicator function. The jump sizes p^- sum up to 1 by construc- 
tion since the penalized log likelihood function (4) is employed after profiling 
out the parameters pij . 
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Remark 3.1 Identifiability of a, (3 and G2{) is guaranteed by the work of 
Gilbert et al. (1999, Th. 2) which shows that if there exist a value xq such that 
h{xo) = 0, then the density ratio model (1) is identifiable. 

Next, consider the asymptotic distribution of 6^ . It is useful to introduce the 
quantities 

6 = VJ(/?), (9) 

a d + 1 dimensional vector, and 

D = VV(/3) (10) 

a (d + 1) X (d + 1) diagonal matrix. The choice of (5) yields 

b = (0, sign(/3i)9 I /3i r-i, . . . , sign{/3d)q \ Pd r')' , 

and 

D = diag(0, qiq - 1) I /3i r^ ...,q{q-l)\f3d r')' . 

Theorem 3.2 Suppose that the conditions of Theorem 3.1 are satisfied and let 
A„ he such that Xn/ ^/n ^ Aq > 0. Then the unique maximizer 6^ tends to a 
(d + 1) -dimensional normal distribution. That is 

V^{0^ -00 + \oS-'b) ^ Afd+i{0, S), 

in distribution, as n — ^ oo, where the ((i+ 1) x (d+ 1) matrix E = S~^V S~^ is 
defined by means of (6) and (1), and b is given by (9). In particular, if Xq = 0, 
then 

xAI(^^-0o)^AAd+i(O,E). 

The above theorem, when compared to the results of Lemma 3.1, does not 
point out to any advantages of the penalization for the density ratio model. 
Indeed, under the aforementioned conditions the maximum penalized empirical 
likelihood estimator of 9 is asymptotically biased and its covariance matrix is 
equal to the covariance matrix of the unrestricted maximum empirical likelihood 
estimator. However the results is asymptotic and a careful examination of the 
proof of the theorem 3.2 shows that the asymptotic covariance matrix S is 
approximated by 

^^i+^^yVf^^i+^^V', (11) 

l + Pi n J \l + pi n ) ' ^ ' 

where A is the empirical estimator of A and D is equal to the diagonal matrix 
D evaluated at /3^, provided that the penalty function is twice diffcrentiable. 
Hence, for a judicial chosen value of A„, the mean square error of 9^ will be 
smaller than that of 9 in small samples. This point is further illustrated in 
the next section by considering some finite sample properties of the restricted 
maximum likelihood estimator, see equation (13) and subsequent discussion. 

The estimate (11) performs satisfactorily even for small sample sizes given a 
known value of the regularization parameter-see next section. For large n, and 
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if A„ — o{^/n), formula (11) reduces to that used by Qin and Zhang (1997) and 
Fokianos ct al. (2001) for the asymptotic variance estimator of the unrestricted 
maximum Ukehhood 6. 

Theorem 3.2 suggests that a standard Wald test can be used to test the 
hypothesis Hq : /3 ~ 0-that is the two samples are identically distributed. In 
particular, under the hypothesis and provided that A„ ~ o(y^), we obtain that 
the test statistic 

W - n[3^ t^^p^ (12) 

where E^j denotes the estimated asymptotic variance of /3^ which is obtained 
by means of (11). Under the null hypothesis, the asymptotic distribution of W 
is the chi-square with d degrees of freedom provided that the regularization 
parameter satisfies the aforementioned conditions. It is clear that (12) depends 
on the choice of the regularization parameter but the notation is suppressed 
for ease of presentation. A limited simulation study-see Section 4- shows that 
the chi-square approximation is valid even though the sample size is relatively 
small. 

Remark 3.2 Application of the density ratio model relies on the assumed func- 
tional form oih[.). Misspecification of this form results in biased estimators and 
loss of efficiency- the point is made by Fokianos and Kaimi (2006) who employ 
the Box-Cox family of transformations to estimate h{.). However, we have tac- 
itly assumed throughout the presentation that the function h{.) is the true 
function associated with the density ratio model (1). 

Remark 3.3 An alternative penalization scheme is based on (5) but with < 
g < 1, see Knight and Fu (2000) for example, who studied the problem in the 
linear regression setup. These penalty functions are rather appealing in the sense 
that they combine model selection and estimation. Theory regarding the density 
ratio model in connection to penalty functions of the form J(/3) when < g < 1 
might be particular useful especially for multivariate observations. In this case, 
the main task is to identify a parsimonious functional form of the log likelihood 
ratio of two, or more, multivariate probability density functions and estimate 
some of the corresponding coefficients while setting the remaining to zero. In 
other words, the problem of modeling multivariate data by the density ratio 
model in large dimensions reduces to that of a model selection and estimation 
problem. 

The last part of this section is devoted to the study of the large sample 
properties of the estimated cdf G2 given by (8). The following theorem stud- 
ies its asymptotic distributions and generalizes the corresponding theorem of 
Qin and Zhang (1997). 

Theorem 3.3 Suppose that the conditions of Theorem 3.1 are satisfied and 
let Xn be such that Xn/\/n ^ Aq > 0. Then, as n -^ 00 in such a way that 
ni/n2 -^ pi. 



{G^{t)-G2(t)+pi\o{Aii{t),Ai2{t))S-^b^^W, 



D\ 
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weakly, where W is a Gaussian process possessing continuous sample paths and 
has mean and covariance function specified by equation (15). 

A similar large sample result holds for G^ . Its proof is along the lines of the 
previous theorem and is omitted. In the next section we provide some empirical 
evidence for the finite sample performance of the estimate 0^ . 

4. Simulation study 

The above results show that the penalized density ratio model depends on both 
the choice of the regularization parameter and the selection of the penalty 
function J(/3). In the sequel we assume that J(/3) = Yli^i^i relying on the 
fact that the linear combination P'h{x) is a smooth function /3. This choice 
leads to the so-called ridge regression type estimators whose mean square error 
is less than that of the corresponding unrestricted maximum likelihood esti- 
mators, for a range of values of A„. For the ridge penalty, (see, for instance, 
Le Cessie and Van Houwclingen (1992)) consider the following approximate ex- 
pansion of the penalized likelihood function for 

wipie^) « v/p(0o) + {0^ - ea)'v%{eo). 

After rearranging terms and taking into account equations (9), (10) and Lemma 
3.1, an approximate expression for 9^ is given by 

e^ « (V2z(0o) + 2A„A.)"' (v';(0o)^o - v;(0o)) , 

where 



Dr = 




Id 



where Id denoted the d-dimcnsional identity matrix. On the other hand, a 
similar argument shows that 6'o = 6* + {V^I{6q)}^^VI{9q) . These two displays 
when combined show that 

e^ « (v2/(6'o) + 2\nD) "^ v'^i{ea)e. (13) 

Therefore 9^ shrinks towards zero as the value of the regularization parame- 
ter increases while for A„ — > 0, ^^ is approaching the unrestricted maximum 
likelihood estimator. In particular, the above representation forms the basis 
for developing an asymptotic expression of the mean square error of 9'^. The 
methodology is quite analogous to ordinary ridge regression and therefore is 
omitted (for more see Hocrl and Kcnnard (19701), a). Consequently, it can be 
shown that the mean square error of /3^ attains its minimum value at some value 
of the regularization parameter, see the lower left hand side plot of Figure 1 for 
an example. In other words, the maximum penalized empirical likelihood esti- 
mator attains smaller mean square error than that of the estimator proposed 
by Qin and Zhang (1997), especially for small samples. 
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Theoretical Quantiles 



(a) 



(b) 





Fig 1. Top: QQ-plot of the test statistic (12) under the hypothesis /3 = for A = 1. Here 
the data have generated according two lognormal populations with m = fJ,2 = 0, ai = 02 = 1 
and ni = n2 = 10. Bottom: (a) Mean square error of (i^ and (b) asymptotic efficiency of 
f) with respect to ji^ as functions of A . The data are drawn from two lognormal populations 
with /ii = 1, /i2 = CTi = (T2 = 1, and ni = n2 = 20. The quadratic penalty function has 
been used for fitting the model and the results are based on 1000 simulations. 
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Table 1 
Estimated coefficient, its mean sguare error (MSB) and power of the test statistic (12) for 

testing j3 = (nominal significance level 0. 05) for different choices of sample sizes and 

regularization parameter values. The data have been generated by a log-normal distribution 

and a quadratic penalty has been used. Results are based on 1000 simulations 



True P 


Sample Size 


A„ 


/3^ 


MSE(^^) 


Power 





m = 712 = 10 





0.0193 


0.3488 


0.021 






0.5 


0.0089 


0.2301 


0.043 






1.0 


-0.0181 


0.1821 


0.050 




711 = 10 





-0.0189 


0.1728 


0.038 




712 = 30 


0.5 


0.0070 


0.1540 


0.057 






1.0 


0.0027 


0.1138 


0.045 


1 


ni = 712 = 10 





1.4123 


9.7126 


0.402 






0.5 


1.0023 


0.2486 


0.530 






1.0 


0.8525 


0.1718 


0.550 




7X1 = 10 





1.1321 


0.3999 


0.707 




n2 = 30 


0.5 


1.0122 


0.1746 


0.750 






1.0 


0.8978 


0.1450 


0.749 



These results are empirically verified by considering the following simple ex- 
ample. Consider the case of two lognormal random samples with corresponding 
density functions 



g{x,fj.^,a^) 



1 



xa 



exp (-(logx - jUj)^/2(7^) , x>0 



for i = 1,2. The density ratio model (1) is satisfied with a = (/Xj — ftl)/2a^, 
p = (/ii — /i2)/CT^ and h{x) = log a: and consider penalized empirical likelihood 
inference based on (4) by choosing appropriately both the penalty parameter 
and the penalty function. In this limited simulation study, the parameter A„ 
was chosen as a constant satisfying therefore the conditions of Theorems 3.1 
and 3.2. Furthermore, a simple quadratic penalty function, that is J(/3) = /3^, 
was used for maximizing the penalized likelihood (4). Table 1 summarizes the 
results of 1000 runs for different sample sizes and for values of A„ . The first six 
lines of Table 1 report results when both groups of data are generated by the 
same log-normal distribution with iJ-i — ft2 — and ai = <72 = 1- The mean 
square error of the penalized estimator of /3, say f3^ is less than the mean square 
error of the unrestricted estimator even for small sample sizes. In addition, the 
nominal significance level of 5% for testing /3 = by using the test statistic 
(12) is achieved satisfactorily in all cases except that of A„ = 0. The variance 
estimate is obtained by means of (11). The last six lines of Table 1 report results 
under the alternative hypothesis. In this case the two log-normal samples were 
generated with /xi = 1, /Z2 = and ai = (72 = 1- The results are along the 
lines of the previous findings. It is interesting to observe that for ni ~ n2 = 10, 
the mean square error of the unrestricted maximum likelihood estimator is large 
when compared with the mean square error of the penalized maximum likelihood 
estimators. 

The top plot of Figure 1 shows a QQ-plot of the test static (12) under the 
hypothesis Hq : /3 = 0. Notice that the approximation is quite satisfactory 
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even though the sample sizes arc relatively small. The bottom left hand side 
plot of Figure 1 shows the mean square error of (3^ as a function of A. As the 
plot illustrates, there exists a value of A such that the mean square error is 
mininiized-recall the above discussion. The right hand side of the plot shows 
the asymptotic efficiency of /3 with respect to (3^ defined by 



MSE[/3] 

For this specific example, the graph illustrates empirically that e{(3\l3^) < 1, 
implying that the restricted estimator is more efficient than the unrestricted 
estimator. 

5. Conclusions 

The density ratio model is a semiparametric alternative to the problem of com- 
paring two, or more distributions functions. However, there are several draw- 
backs for its application especially when small samples are under consideration. 
The first part of this work dealt explicitly with the penalization approach to the 
empirical likelihood methodology. It was shown that the method is also moti- 
vated by Bayesian arguments and it is worth considering its performance to other 
settings. As a result, a new two sample methodology emerges where estimation 
of the parametric part can be carried out by a number of statistical programs 
while estimation of the non-parametric part is achieved by simple calculations. 
In particular the problem of existence of maximum empirical likelihood estima- 
tors is resolved successfully. In addition, the final estimators have smaller mean 
square error than the estimators without penalty. The test statistic-see (12)-for 
testing the equality of two distributions is obtained and the results were applied 
to simulated case control data. Application of the density ratio model depends 
on a number of assumptions, in particular the assumed functional form of model 
(1), the regularization parameter and the form of the penalty function (5). It is 
suggested to vary the regularization parameter within a predetermined range of 
values and use different functions h{x) to examine the sensitivity of the results. 
Penalization by the quadratic penalty seems to be a sensible idea for several ap- 
plications. As a final remark, we note that the problem can be generalized further 
by considering multiple biased sampling models-or semiparametric ANOVA-as 
in Fokianos et al. (2001). This will allow for semiparametric comparisons among 
several treatments and will also yield estimates of the unknown cdf. Further- 
more, it is worth studying estimation of functionals associated with the baseline 
distribution, see Zhang (2000), in the presence of a regularization parameter. 
Different penalty functions can be taken into account and this methodology 
is quite promising for semiparametric comparison of multivariate distribution 
where the true functional form of (1) is complicated. 
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Appendix 

Proof of Theorem 3.1 

Let M be a vector with || u \\— M where M is a large constant. Suppose that 
for every e > there exists Af > such that 

P| sup lp{ea + n-^''^u) < lp{0Q)} > 1 - e. (14) 

||«|| = A/ J 

That means there exists a local maximizer with probability tending to 1, denoted 
by ^^, in the ball {do + n-^^^u :|| u \\< M} such that || d^ - 9o \\^ Opin-^^^). 
Now 

lp{0o + n-'/^u)-lp{eo) = {i(eo + n-^/\) - i(eo)) 

-A„(j(/3o + n-i/2y)- J(/3o)) 

= E1+E2 

But 

2 6 

where 9* lies in the line segment connecting the points 9o and 9q + n^^/^w. 
Standard arguments show that 

I n-i/^V'K^o)" I = Op{l)\\u\l 
-nT uS/ l{9o)u = ——u'Su + Op{l), 



In^y^^' I„'X7^]lft* 



n -'-V {uV^l{9*)u}u = Op{l) 



where (6) has been used and S has been defined by (7). Similarly, we obtain 
that 

E-2 = -Xn In-'/' ^ {q\f3,rhign{f3,)u,I{P, ^ 0)) 



ln-1 ^ q{q - 1) (|/3, r'up{f3, + 0)) + o(l) ] 



Again, it can be shown that both terms of the above summand are 0(1) and 
0(1) respectively, provided that Xnj^/n ^ Aq > upon recalhng (9), (10) and 
the fact that 6*0 lies in a compact set. For large 1 1 m 1 1 all the terms are small and 
dominated by ^n~^u'V^Z(0o)'w. Because of Lemma 3.2, 9^ exists with probability 
1, hence the theorem follows. If Aq = 0, then the sequence is consistent since 9^ 
can be chosen independently of u. 



K. Fokianos/ Penalized logistic regression 578 

Proof of Theorem 3.2 
A Taylor expansion shows that 

= vip{e^) = vip{9o) + v^ip{e*){e^ - ^o) 

where 9* hes in the segment between ^g a-nd 0^ . Rearranging terms in the above 
expression we obtain 

= - {y\{Oo)Y^ E [V/p(0o)] - {^\{0*)Y^ {V/p(0o) - E [V/p(0o)]} 
- {{^\m]-' - {V%(0o)}"'}e[V/p(0o)] 

Then apphcation of the weak law of large numbers together with the central 
limit theorem shows that 

as n ^ oo , where the first and third convergence arc in probability. These results 
prove the theorem. 

Proof of Theorem 3.3 

Following similar arguments as those of Qin and Zhang (1997), the following 
representation is obtained by a Taylor expansion 

G\(t)-G2{t) + ^(AnW,^'i2W)(--V^Z(eo) + — VV(/3)1 E[VZp(0)] 
n y n n ) 

^ A{t)^G2it) + K{t)+Rn{t), 
with suptg[_^ „„] I Rn{t) h Op(n-i/2)^ 

Aft) = VT ^^^'' - *^ 

tr,'lt"2(l + Piexp(a + /3'Mx„)))' 
and 



P^ ^ A U\ M ^i^^ f lv72;//l ^ , ^n V72 



K{t) = _^(An(i),Al2(i))-- V^Z(0o) + — V^J(/3) 
n L ^ 1^ 

x(VZp(0o)-E[V/p(0o)]). 
Then, 

E[V^{A{t)-G2it)+K{t))]=0 



-1 
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and after lengthy calculations 

Cov[V^(A(t) - G2{t) + K{t)) ; V^(A(s) - 62(5) + ii:(s))] = 

(l + Pi)(G2(iAs)-G2(i)G2(s)) 
- pi(l + pi)Aii(iAs) 
+ pi(l + pi)(An(t),Al2(t))'A-i 
X (An(s),^;2(^))- (15) 

In the above derivation, we use the fact that Xn/\/n —>■ Xq so that 

|_ n n J 

where S is recalled by (7). Therefore the central limit theorem and the Cramer- 
Wold device imply that the finite dimensional distributions of the process 
^/n{A{t) — G2{t) + K{t)) converge weakly to those of a Gaussian process with 
mean and covariance function given by (15). In order to prove the conclusion 
of the theorem, it is sufficient to show {^Jn{K{t) — G2{t) + K{t)), t g [—00, 00]} 
is tight in D[—oo, oo]-a fact which can be shown by using tightness criteria. 
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